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ABSTRACT 

Variance is a popular and often necessary component of 
aggregation queries. It is typically used as a secondary 
measure to ascertain statistical properties of the result 
such as its error. Yet, it is more expensive to compute 
than primary measures such as SUM, MEAN, and COUNT. 

There exist numerous techniques to compute variance. 
While the definition of variance implies two passes over 
the data, other mathematical formulations lead to a single¬ 
pass computation. Some single-pass formulations, how¬ 
ever, can suffer from severe precision loss, especially for 
large datasets. 

In this paper, we study variance implementations in 
various real-world systems and hnd that major database 
systems such as PostgreSQL 9.4 and most likely Sys¬ 
tem X, a major commercial closed-source database, use 
a representation that is efficient, but suffers from floating 
point precision loss resulting from catastrophic cancel¬ 
lation. We review literature over the past five decades on 
variance calculation in both the statistics and database 
communities, and summarize recommendations on im¬ 
plementing variance functions in various settings, such 
as approximate query processing and large-scale distributed 
aggregation. Interestingly, we recommend using the math¬ 
ematical formula for computing variance if two passes 
over the data are acceptable due to its precision, paral- 
lelizability, and surprisingly computation speed. 

1. INTRODUCTION 

New large-scale distributed data management and 
analytics systems are being developed at a rapid 
pace, with the scalability aspect of computation be¬ 
ing their predominant development focus (except¬ 
ing [^). Comparatively lesser efforts have been ex¬ 
pended on ensuring numerical correctness and sta¬ 
bility of algorithms. While such an approach can 
result in the queries being answered more quickly, 
it can also cause the computations to have a higher 
level of numerical imprecision. 

The concern of achieving numerical stability and 
precision is pertinent in numerous computational 



Figure 1: Effect of Variance Error on T-Test 
Confidence Intervals: As the magnitude of 
data values increases (x-axis, true margin of 
error is kept consistent for each dataset), the 
mean is expected to increase, and the size 
of error bars is expected to stay the same. 
However, PostgreSQL 9.3 and System X er¬ 
ror bars (a = 0.05) vary widely, while System 
Y has correct error bars. (100 data points 
are generated from a Uniform{0, 1) distri¬ 
bution and shifted using additive shifts of 
IQShift Exponent different values of Shift Ex¬ 
ponent. A detailed analysis is provided in 
Section 

scenarios; it is especially important in variance cal¬ 
culation, which has an ubiquitous presence in large- 
scale analytics and is known to suffer from precision 
issues [^. Variance is an important aggregate func¬ 
tion and an essential tool in sampling-based aggre¬ 
gation queries. Typically used as a secondary mea¬ 
sure, it augments measures such as AVERAGE and 
provides an insight into the distribution of the data 
beyond the primary measure. Computation of vari¬ 
ance, however, is susceptible to precision loss when 
the variance is much smaller than the mean |^. 

There exist several techniques to compute vari¬ 
ance. The standard variance formula uses two passes 
to provide an accurate estimate (Two Pass). Other 
techniques using a single pass over data store basic 
statistics such as count, sum, and sum of squares, 
due to common perception of Two Pass being more 







expensive due to needing two passes. One such for¬ 
mula, although fast, is known to suffer from preci¬ 
sion loss [Textbook One Pass) due to catastrophic 
cancellation [^, an undesirable effect of a floating 
point operation that causes the relative error to far 
exceed the absolute error. Figure demonstrates 
this problem. As a side note, this problem has been 
noted to affect calculators as well [^. 

Another formula ( Updating), which has been rec¬ 
ommended by Knuth , has found a strong foothold 
in the database community, with numerous imple¬ 
mentations citing Knuth in their documentation. 
However, this formula is constrained by the fact 
that it can only incorporate a single data point into 
the current running estimates. It is unable to com¬ 
bine the estimates from different subsets of data. 

Given the rise of large-scale data processing, mas¬ 
sive multi-core support and availability of GPUs, 
it is prudent to consider using representations that 
can be combined at a larger scale instead of in¬ 
crementally incorporating a single data point, such 
as Pairwise Updating. Further, Pairwise Updating 
is also known to have a better precision, as shown 
by Chan et al. for single precision input (and, as 
verified in Section]^ for double precision as well.) 

Contributions &: Outline: 

• We analyze source code for various open source 
database systems to catalog usage of different vari¬ 
ance formulas (Table [^. 

• We experiment with different closed source and 
open source databases to investigate precision loss 
issues. We find that precision of PostgreSQL and 
System X deteriorates the most. After looking at 
the PostgreSQL source code, we can verify that it 
uses Textbook One Pass, and hypothesize that Sys¬ 
tem X does so as well (or uses a similar variant). 

• We empirically study the accuracy of the dif¬ 
ferent representations under varying additive shifts 
and dataset sizes including a hitherto unstudied one, 
which we call Total Variance. 

• We recommend using Two Pass if performing 
two passes over data is acceptable (Section]^, which 
seems counter-intuitive, but works due to its com¬ 
putational simplicity. 

In the next subsection, we look at the adverse 
effects of imprecise variance calculation. Section 
presents the different variance representations and 
their properties. We then detail the representations 
used by modern databases in Section Section]^ 
lists our analysis of the behavior of the different for¬ 
mulas (double precision input compared with single 
precision in Ghan et al. [^). Finally, we conclude 


with our recommendations for variance representa¬ 
tion in current environments. 

1.1 Impact of Variance Calculations 

Due to the pervasive use of variance, a loss of pre¬ 
cision can have an impact in a variety of different 
domains. In the following paragraphs, we look at 
some use cases where the lack of precision in vari¬ 
ance calculation can have adverse consequences. 

Incorrect Output: It is possible to experimen¬ 
tally observe the loss of precision as incorrect out¬ 
put. In order to illustrate the pitfalls in using Text¬ 
book One Pass, data points were generated from a 
Uniform{0, 1) distribution and shifted by Exponent 

for Shift Exponent varying from 1 to 14. The vari¬ 
ance obtained by using a shift exponent should be 
expected to be similar to the one without any shift. 

We verify this by adding and subtracting the shift 
exponent and note that the variance of the resultant 
dataset was close to the true sample variance. 



Figure 2: Effect on PostgreSQL and System 
X: The confidence interval length (a = 0.05, 
COUNT = 100), which is derived from vari¬ 
ance, instead of being nearly constant, be¬ 
haves irrationally due to Textbook One Pass. 
The corresponding PostgreSQL query can 
be given by SELECT ti-fX stddevCcolumn) / 
sqrt(count(column)) FROM Table. 

Figure shows that PostgreSQL 9.3 and System 
X suffer from variance calculations being suscepti¬ 
ble to precision loss since variance should approxi¬ 
mately stay the same. We know that PostgreSQL 
uses Textbook One Pass and the pattern of the erro¬ 
neous calculations displayed by both of them hints 
towards System X using it as well. 

In contrast, other database systems suffered mi¬ 
nor precision loss, as expected (these results are not 
shown since they do not add any additional informa¬ 
tion to the figure). It should be noted that System 
Y was found to be highly immune to precision loss. 

Visualization: Erroneous variance calculation, how¬ 
ever small, can have a notable impact on visualiza¬ 
tions. As a demonstration, we show the results of a 






repetition of the above experiment in Figure and 
depict the sample mean and the confidence interval. 
Due to precision loss, we observe inaccurate results 
for higher shift values for PostgreSQL 9.3 and Sys¬ 
tem X. While the error bars should be similar, they 
instead vary widely and inaccurately. Error bars for 
System Y are correctly low throughout. 

Negative Variance: It is possible for variance to 
be negative while using Textbook One Pass - a the¬ 
oretically impossible result (Table [^. We observed 
in the PostgreSQL source code that variance is set 
to zero, if negative. Figure[^shows numerous values 
of 0 (i.e., missing error bars) for PostgreSQL (shift 
exponent 8, 9, and 12) and also for System X (shift 
exponents 10 and 11), providing evidence of System 
X employing a similar strategy for handling nega¬ 
tive variance values and using Textbook One Pass. 

Decision support systems: As a building block 
in popular algorithms, flaws in variance implemen¬ 
tations can have far-reaching impacts, e.g., in hy¬ 
pothesis testing, which is an integral part of nu¬ 
merous decision support systems. Having imprecise 
or incorrect variance estimates can greatly change 
the result of hypothesis testing. 

Loud Failure: Consider the case of 1 sample 2 tailed 
t-test with the shift exponent of 8 using the output 
of PostgreSQL as given in Figure Let the null 
hypothesis be as follows: 

iLo- M = 10® + 0.483594 (sample mean) 

And the alternate hypothesis as: 

Ha: 10® -f 0.483594 

The t-statistic can be given by where /r is 

the hypothesized mean estimate, x is the sample 
mean, s is the sample standard deviation and n is 
the sample size. In this case, since s is 0, the t-test 
will fail by reporting an error. 

Silent Failure: We now look at the more harmful 
error of silent failures. Let us consider the sample 
with shift exponent of 12 and use the output of 
System X. Again, let the hypotheses be as follows: 

Ho: IX = X 

Ha: fx^X 

Here, X is the hypothesized population mean. 
Let a (confidence level) be 0.05 with the resultant 
critical value of 1.98. The t-statistic will be 2634 

instead of o%° 278 ~'^ - variance calculation 

were correct, the range of X for the hypothesis 
testing to not reject it would have been [10^^ -I- 
0.475,10^^ -I- 0.585], which is small compared to 
the now permissible [10^^ — 5137.03,10^^4-5138.08]. 
Thus, we can see that for a large range of X, the 
null hypothesis will end up not being rejected 
without the user any wiser. 


Data Mining: Variance is an important tool in 
statistical analysis and machine learning algorithms 
such as Gaussian Naive Bayes, or Mixture of Gaus- 
sians based algorithms such as background model¬ 
ing, clustering, or topic modeling. For example, we 
found usage of Textbook One Pass within a graphics 
library of the R language . Similarly, MAD lib 
was also found to have a call to the PostgreSQL 
variance function: thus, an erroneous calculation of 
variance can extend from the underlying databases 
to the systems built on top of them. 


2. DIFFERENT WAYS TO 
CALCULATE VARIANCE 

Table presents the common variance represen¬ 
tations [^. We use a similar naming convention to 
that used by Chan et al. [^. S stands for the sum of 
squares. The sample variance can be given by jfzi ; 
where N is the sample size. Xi is the data point. 
X is the sample mean. Mm,n is the mean of the 
data points from indexes m to n (both inclusive). 
T.ai,n is the total of the data points from indexes m 
to n (both inclusive). We have also described Total 
Variance, for which we could not find a reference. 
In its formula, n^, mi, and Vi represent the count, 
mean, and variance respectively, of the group. 

Textbook One Pass can be computationah^ dan¬ 
gerous as the quantities 

can nearly cancel each other out. The Pairwise Up¬ 
dating formula hierarchically combines pairs of vari¬ 
ance values and uses 0{log{N)) storage while reduc¬ 
ing the relative errors from 0{N) to 0{log{N)) [^. 
Updating-YC represents Youngs and Cramer for¬ 


mula 15 and is essentially identical to Updating 
Pairwise when m = 1 or n = 1. The Updating- 
WWH formula refers to the nearly identical for¬ 
mulas used by Welford et al. 


West et al. 14 


and Hanson et al. and has similar precision as 
Updating-YC. We have used the Updating-WWH 
representation for updates using a single data point, 
and denote it by Updating. Shifting the data by an 
exact or approximate value of x (Shifted One Pass) 
can also result in substantial accuracy gains |^. 


2.1 Total Variance 

Since this is the first paper to introduce the To¬ 
tal Variance representation, we explain its steps in 
more details below. In the first pass, which is over 
the individual tuples, the variance (using one of 
the other formulas), mean, and count, of individual 
groups are computed. The second pass, over the 
groups thus formed, finds the overall mean of the 
data. In the third pass, over the groups, the overall 
variance is then found. Since the second and third 






Name 

Formula 

Accuracy 

Passes 

Storage 

Parallel 

Two Pass 

-5'= 

/V 

/ 

2 

0 (1) 

/ 

Textbook One Pass 


X 

1 

0 (1) 

/ 

Shifted One Pass 


Varies 

1 

0 (1) 

/ 

Pairwise Updating 

^l,m+n — ^l,m ^m+l,m+n 

*5^1,m+n Si^rn 

m ( n rp rp \2 

n(m-l-n) ^m+l,m+n) 

/ 

1 

0(log(N)) 

/ 

Updating-YC 

Tij = Ti j_i Xj 

Si,3 = -S'lj-i -)- (ja:j ri,j)2 

/ 

1 

0 (1) 

X 

Updating-WWH 
(Updating) 

Ml,, = + 

Si,3 = SlJ-l + {j - l)x 

(x,-Mi,,i)x(^^^^) 

/ 

1 

0 (1) 

X 

Total Variance 

S = n^irrii - 

i]?;r^(n,-i)u. 

/ 

3 

Varies 

Varies 


Table 1: Commonly used Formulas for Variance 


passes are over the groups obtained as a result of 
the first pass, and different formulas can be used to 
compute variance of individual groups, complexity 
of the overall algorithm can vary widely. 

While this representation is highly parallelizable 
at the second and third passes, its overall paralleliz- 
ability is dependent upon the formula used to find 
variance of individual groups. Note that this rep¬ 
resentation is designed for combining variances of 
different groups and is agnostic to the representa¬ 
tion used for individual groups. While we have used 
Updating at the group-level in our implementation, 
it can be replaced by others. 

Computing mean of individual groups is a well- 
researched subject with Tian et al. 12 providing a 
good overview. We use a single pass algorithm to 
compute mean of individual groups and to combine 
means of groups as well. To handle a large num¬ 
ber of groups, one can look into using an aggrega¬ 
tion tree to combine means. The usual technique of 
mean estimation can be used in case the number of 
groups is large, at the cost of decreased precision. 

There does not appear to be a theoretically ideal 
group size for Total Variance, and we could not 
determine one experimentally either (Section 4.5). 
One natural way of setting group sizes, in distributed 
execution, is to consider data across different nodes 
as individual groups. Further, data within a node 
can be partitioned into equal-sized subgroups, so 
that each core works on a single subgroup. 


2.2 Properties of Different Representations 

While Chan et al. provide an overview of the 
accuracy, passes, and storage required for most of 
the formulas given in Table (other than Total 


Variance), their classification as being distributive, 
and thus the ability to be parallelized, has not been 
explicitly listed before, which we do. In Table 
the Storage column depicts the extra space needed 
for computing variance, which is above and beyond 
that needed to store the data itself. 

The accuracy of Shifted One Pass depends on the 
accuracy of the estimate of the mean. Pairwise Up¬ 
dating is the only representation giving accurate re¬ 
sults while being highly parallelizable and requiring 
a single pass. Additionally, as we will see in Sec¬ 
tion 1^ the precision of Total Variance is slightly 
better than that of Updating Pairwise, which typi¬ 
cally has the best precision amongst all single pass 
algorithms. As a side note, amongst the differ¬ 
ent representations. Two Pass, Total Variance and 
Textbook One Pass are the only ones that can be 
represented using a standard SQL query. 

We note that the error bounds for Two Pass are 
derived by Chan et al. [^, and those for Textbook 
One Pass and Updating are provided in HI- i de¬ 
rives error bounds for Shifted One Pass, and conjec¬ 
tures them for Pairwise Updating. Table 2.1 of 
succinctly enumerates them. Note that Kahan sum¬ 
mation [S 12 can help improve their precision. 


2.3 Data Conditioning 

Data shifting and scaling are immensely useful in 
improving accuracy of algorithms [^. For example, 
shifting the data by its mean is the basis for Shifted 
One Pass. Indeed, Chan et al. demonstrate the 
usefulness of shifting by an approximate mean com¬ 
puted using a sample of the data by proving that it 
reduces the bounds of the condition number. 

Further, numerous techniques such as dividing by 
























the mean or using the log function are helpful 
in improving the accuracy. However, along with 
requiring additional computational resources these 
techniques can also worsen the accuracy under mali¬ 
cious datasets , and need careful user supervision. 

2.4 Hybrid Formulae 

It is clear that different implementations can be 
used to find variance of different groups, and com¬ 
bine partial results. Indeed, it has been brought 
to our attention that a commercial system uses the 
Updating-YC formula to compute variance at in¬ 
dividual nodes, and combines them using Pairwise 
Updating formula. Total Variance is a hybrid for¬ 
mula as well, since variance of the groups needs to 
be computed using one of the other representations. 
This provokes an interesting piece of future work - 
choosing different representations at different com¬ 
putation steps, based on factors such as streaming 
data, numerical precision, data partitioning, time 
for first result, number of passes permissible. This 
idea is elaborated upon in Section 

2.5 Current Recommendation Guidelines 

Chan et al. provide detailed recommendation 
guidelines on the use of different variance formu¬ 
las. They recommend usage of Pairwise Updating 
for combining variances across multiple processors 
since it reduces the errors and is massively paral- 
lelizable if extra 0{log{N)) space is available. Fur¬ 
ther, it is also the safest (least precision loss) algo¬ 
rithm to use within each processor, under the con¬ 
straint of a single pass. Two Pass provides the best 
precision amongst all algorithms, but requires two 
passes. Based on insights obtained through pre¬ 
vious work and our experiments, we provide our 
guidelines in Section which are simple and dras¬ 
tically different from the current guidelines. 

2.6 Extensibility to Other Measures 

Standard deviation, standard error, and coeffi¬ 
cient of variation are important statistical measures, 
and are based on variance computation. As a result, 
these measures will be affected by the properties of 
the underlying variance representation. Similarly, 
the properties will also extend to any user-defined 
measure whose variance can be expressed in a closed 
form as a function of the variance of one of the mea¬ 
sure dimensions. For example, for a user-defined 
measure given by a * MG(Agg) + 5, where a and b 
are constants and Agg is a measure dimension, the 
variance of the measure can be given in closed form 
as *VAR1MCE{Agg). Note that obtaining a closed 
form solution to the variance of holistic or complex 
measures is not always possible, with bootstrapping 


being a popular choice for variance estimation [^. 


3. VARIANCE IMPLEMENTATIONS 
IN MODERN DATABASE SYSTEMS 

Given the variety of variance formulas, we now 
survey various open source databases to find out 
which formulas are used by them to compute vari¬ 
ance. Based on our experiments, we also conjecture 
about two closed source databases. Table[2]lists the 
formula used in each database system. 


Database 

Formula 

PostgreSQL 

9.4.4 

Textbook One Pass 

MySQL 5.7 

Updating 

Impala 2.1.5 

Updating Pairwise 

Hive 1.2.1 

Updating Pairwise 

Spark 1.4.1 

Updating Pairwise 

SQLite 

No Variance Support 

System X 

Textbook One-pass (Conjecture) 

System Y 

Higher precision variables (Guess). 
Cannot conjecture about formula 


Table 2: Variance Implementations in Mod¬ 
ern Databases 


PostgreSQL uses Textbook One Pass and is thus 
susceptible to precision loss. MySQL uses Knuth’s 
modification 10 of Welford’s updating formula. There¬ 
fore, it can only process a single additional data 
point, and cannot avail of the possible paralleliza¬ 
tion. Spark 1.4.1 and Impala 2.1.5, on the other 
hand, use a modified version of Updating Pairwise. 

Although the source code for System X is not 
available, we conjecture that it uses Textbook One 
Pass as its precision behavior was similar to that 
of PostgreSQL. System Y was found to have the 
best precision. We hypothesize that it uses higher 
precision variables, but cannot make any conjecture 
about the exact representation. 


4. EXPERIMENTAL ANALYSIS 

Chan et al. have looked at the precision of dif¬ 
ferent algorithms using single precision input. We 
present the precision results using double precision 
input. We also evaluate the precision of Total Vari¬ 
ance. In addition, we look at the precision in the 
variance calculation offered by the different databases. 
We also present the execution times of different al¬ 
gorithms on data sizes up to 100 million tuples. 
The presented results are the average over 100 runs. 
Results from Section |T^ till Section |T6| were per¬ 
formed using Ubuntu 14.04.05 LTS with a 4 core, 
2.4 GHz Intel CPU, with 16 GB RAM, and 256 
GB SSD storage, on a single execution thread. To 













look at the parallelization speedups, which are pos¬ 
sible for some representations, Section [4.7| provides 
multi-threading-based results. 


Dataset: Although numerous benchmarks exist to 
evaluate the accuracy of numerical algorithms, they 
are constrained by the fact that their the dataset 
sizes are quite limited. For example, the biggest 
dataset in the NIST StRD 11 benchmark consists 
of 5000 points. Furthermore, for this dataset, the 
mean is not significantly larger than the standard 
deviation (/r = 4.5348, a = 2.8673). Therefore, in a 
similar vein as Tian et al. [12| , who generated sim¬ 
ulated datasets inspired by NIST StRD, we created 
synthetic datasets of different sizes with double pre¬ 
cision from Uniform{0, 1), with the resulting vari¬ 
ance of These samples were shifted by adding 
values ranging from 10^ to 10^^. 


4.1 Impact of Shift 



Shift Exponent 

Figure 3: Impact of Increasing Shift on Pre¬ 
cision: With increasing shift exponent, all 
representations experience precision loss — 
though some more severely than others. 

Numerical precision was evaluated using varying 
additive shift exponents, over a dataset of size 10000. 
Group size was set to 10 for Total Variance. We 
present our findings in Figure]^ where Y-axis rep¬ 
resents the number of correct decimal digits (non- 
fractional part of the result was 0). We found the 
results to be as expected [^, with Two Pass having 
the best precision, and Textbook One Pass being 
clearly impacted by the increasing shift exponent. 


4.2 Impact of Data Size 

Since precision errors typically accumulate, we 
used datasets of sizes ranging from 10 to 100 mil¬ 
lion. The shift was set at 10^. We can see from 
Figure]^ that as expected, in most cases the preci¬ 
sion worsens with increasing data size. Two Pass 
again outperforms other algorithms. Textbook One 
Pass shows consistently worst precision. 

Counter-intuitively, the precision of Total Vari¬ 
ance and Updating Pairwise was found to increase 
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Figure 4: Impact of Increasing Data Size 
on Precision: Precision generally decreases 
with increasing dataset size, with exceptions 
of Total Variance and Updating Pairwise. 


with the data size exponent from 2 to 6. We are un¬ 
able to conjecture the reason behind this behavior. 
The precision error for Updating Pairwise increases 
as 0{log{n)), while that for others (except Total 
Variance) increases as at least 0{n) [^, where n is 
the data size. Therefore, while we can expect the 
error in Updating Pairwise to not increase at the 
same rate as other algorithms, the error decrease 
is unexpected. In the absence of theoretical error 
bounds for Total Variance, we cannot hypothesize 
about the possible causes for its behavior. To en¬ 
sure there were no irregularities, the experiment was 
repeated multiple times with similar results. 



12345678 

Shift Exponent 

Figure 5: Impact of Shift on Databases: 
Databases follow precision patterns that are 
expected from their variance formulas. 

We look at variance precision for different databases 
under varying additive shifts, for similar datasets, 
which are prone to precision errors. We took efforts 
to ensure different systems have similar data types. 
100 points were chosen from a Uniform(0, 1) dis¬ 
tribution. Figure shows that precision loss follows 
a similar pattern in System X and PostgreSQL. Im- 
pala and MySQL have a similar error prohle as well. 

4.4 Single-Threaded Execution Speed 












Shift 

mantissa{Si) 

mantissa{S 2 ) 

S = Si-S 2 

variance 

None 

0xa7677ed386b82 

0x3f74ce8319d49 

831.5840227247941 

0.08316671894437384 

1 

0xda6elccdb823 

0xd72e874b34ca 

831.5840227215085 

0.08316671894404526 

2 

0x8156ee01176cb 

0x81561elbb6cb4 

831.5840228646994 

0.08316671895836578 

3 

0x2a531c0d87ff3 

0x2a531a6dbd3c9 

831.5840644836426 

0.08316672312067633 

4 

0xdlb557c3f3080 

0xdlb557bd73dc9 

831.5848388671875 

0.08316680056677543 

5 

0x6bcd32f7f2a8c 

0x6bcd32f7e5a78 

832.3125 

0.08323957395739574 

6 

0xlc37a6532f3c2 

0xlc37a6532f25c 

716.0 

0.07160716071607161 

7 

0xbcl6d9663a96c 

0xbcl6d9663a8ef 

16000.0 

1.6001600160016 

8 

0x5afld7c632dda 

0x5afld7c632df7 

-475136.0 

-47.518351835183516 


Table 3: Example of Application of Textbook One Pass 



Figure 6: Single-Threaded Execution Speed: 
Though Two Pass requires 2 passes over 
data, it provides results faster than other al¬ 
gorithms, with exception of Textbook 1 Pass, 
which has the least numerical precision. 

We also looked at the execution time of different 
algorithms with increasing data size. Results with 
lower data sizes have not been presented due to the 
computation taking minimal time. This experiment 
presented us with interesting results. Surprisingly, 
there was no discernible difference in execution time 
between Two Pass and Shifted One Pass. Textbook 
One Pass was the only algorithm that took lesser 
time than Two Pass. We attribute the low execu¬ 
tion time of Two Pass to simplicity of its compu¬ 
tation. Due to superior accuracy, least execution 
time after error-prone Textbook One Pass, and ease 
of implementation and parallelization, we suggest 
that Two Pass should be the algorithm of choice if 
performing two passes over the data is acceptable. 

4.5 Impact of Group Size on Precision 

Since group size is an integral component of our 
Total Variance algorithm, we looked at the effect 
different group sizes have on precision. Figure 
shows that there does not exist any clear relation¬ 
ship between them, though precision increased in a 
majority of cases with increasing group size. Thus, 
there does not appear to be any ideal group size 
from the perspective of precision. We also note that 



Figure 7: Impact of Group Size on Preci¬ 
sion: Increasing group size improves preci¬ 
sion slightly for some data sizes, although 
there does not exist a clear relationship be¬ 
tween precision and group size. 

there did not exist any significant differences in the 
execution time for varying group sizes. 

4.6 Textbook One Pass in Action 

To further illustrate catastrophic cancellation oc¬ 
curring in Textbook One Pass, Table presents the 
corresponding mantissa of the two expressions that 
compose it. We consider a random sample of size 
10000 generated from Uniform{0,l), and shift it 
by exponents ranging from 1 to 7. Note that Text¬ 
book One Pass calculates the sum of squares as 5" = 
S 1 -S 2 , where Si = and S 2 = XiY- 

We can see that an increasing number of bits in the 
mantissa of and S 2 become equal, until all pre¬ 
cision is lost for the shift exponent of 6. 

4.7 Multi-Threaded Execution Speed 

To determine the possible speedups due to paral¬ 
lel execution, the algorithms were parallelized and 
run on an Ubuntu Linux 14.04.1 LTS system with 
a 48 core 2.4 GHz Intel Xeon CPU, with 256 GB 
memory, and a 500 GB disk. With the exception of 
Updating, other representations were able to ben¬ 
efit from parallelism. We can again observe that 
Two Pass has similar execution time as Shifted One 
Pass, with only Textbook One Pass taking lesser 
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Figure 8: Multi-Threaded Execution Speed: 
Updating cannot avail of parallelism, while 
other algorithms can. Two Pass again pro¬ 
vides results quicker than Updating, Updat¬ 
ing Pairwise, and Total Variance. 


time. Thus, in both single-threaded and multi-threaded 
environments, Two Pass performed exceedingly well. 

We note that there were only minor changes in 
precision due to small modifications being added to 
them for parallelization. Further, in a similar fash¬ 
ion as Section 4.5 varying group sizes in the Total 
Variance representation did not result in significant 
difference in precision or execution time. 


5. CONCLUSION & RECOMMENDATIONS 

Floating point precision can cause information 
loss in both data measurement as well as data stor¬ 
age. This problem is further exacerbated to varying 
degrees by different variance calculation formulae. 

Precision issues associated with Textbook One Pass 
have been well documented. However, we have seen 
that databases such as PostgeSQL and likely Sys¬ 
tem X still use it. We recommend from the per¬ 
spective of safety to discontinue its usage. Though 
there might be arguments for its continued usage 
after warning the users in certain scenarios, the ar¬ 
guments against it far outweigh the speedup bene¬ 
fit and its ease of implementation. Although error 
inherently exists in approximate query processing, 
numerical precision errors are easy to eliminate and 
hard to apportion and therefore should be avoided 
whenever possible. Hence, we recommend to the 
designers of databases, and statistics and analytics 
packages, to discontinue its usage. Further, it would 
be wise for users to perform a sanity check using ex¬ 
periments similar to those given in Section |4.1| 

Previous work has recommended Pairwise Updat¬ 
ing from the perspective of precision, speed, and 
parallelizability |^. However, we have seen from 
our experiments of up to 100 million data points, 
that the most accurate algorithm, Two Pass, takes 
lesser time than Updating, Updating Pairwise, and 
Total Variance. Further, it takes around the same 


amount of time as Shifted One Pass, which relies 
on mean estimation. Two Pass is also easy to im¬ 
plement and parallelize. Therefore, in the case that 
performing two passes over the data is ac¬ 
ceptable, Two Pass should be the preferred 
algorithm. Determining whether two passes are 
acceptable, however, is a nuanced decision. When 
the data fits in memory, performing two passes over 
the data is clearly acceptable as all representations 
will incur the identical data read I/O cost. When 
the data cannot fit in memory, summing up the es¬ 
timated I/O and computation times can help deter¬ 
mine whether Two Pass will need the least amount 
of time, in which case it should be chosen. 

In other cases, i.e., whenever Two Pass is 
not estimated to require the least execution 
time, there does not exist a clear winner, due 
to different algorithms having different strengths 
and weaknesses. Updating provides faster results 
at lower precision, compared with Updating Pair¬ 
wise, without needing additional memory. Updating 
Pairwise is parallelizable, whereas Updating is not. 
While Shifted One Pass provides quick results, its 
accuracy is dependent on correctness of the mean 
estimate. Total Variance has good accuracy, al¬ 
though it takes longer to execute, and is dependent 
on the algorithm used to compute group statistics, 
while also needing multiple passes. Hence, there 
does not exist any algorithm that dominates ev¬ 
ery other algorithm, resulting in there not being 
a clear choice. We can thus see that a query plan¬ 
ner that devises hybrid formulas, while taking the 
data distribution, estimated I/O and computation 
costs, and the overall strengths and weaknesses of 
different algorithms into consideration, appears to 
be an important and ideal piece of future work. 
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8. APPENDIX 


8.1 Total Variance Derivation 

Suppose the dataset D contains N data points, 
with the i*^ data point having the value Xi. Sample 
variance of the entire dataset can be given by u = 
Iv^T ~ xYi and the sample mean by x = 

^ consist of K separate groups, 
with the i*^ group Di consisting of Ui data points. 
The mean, mi, and variance, Vi, of the group 
can then be given respectively by 


rrii = 


1 

m 



^ J 

jeDi 


Vi = 



Y, {Xj - 

jeDi 


The sample variance of D can then be broken up as 
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